Engineering Features from Raw Sensor Data to Analyse Player Movements during Competition

Research in field sports often involves analysis of running performance profiles of players during competitive games with individual, per-position, and time-related descriptive statistics. Data are acquired through wearable technologies, which generally capture simple data points, which in the case of many team-based sports are times, latitudes, and longitudes. While the data capture is simple and in relatively high volumes, the raw data are unsuited to any form of analysis or machine learning functions. The main goal of this research is to develop a multistep feature engineering framework that delivers the transformation of sequential data into feature sets more suited to machine learning applications.


Introduction
Technological advancements have made it possible to measure positions, motion, and inertial forces related to human movements with several types of new instruments within the field of wearable sensors.Due to their low cost, cutting-edge technology, and easy portability, wearable sensors have been widely used in the last decade to assess and analyse physical activity [1][2][3], game demands [4], and external loads in sports analytics [5][6][7].The capability of wearable devices for assessing and measuring human physical performances enables them to become a fixture in the near future of sports science.
In the case of GPS technology, these devices transmit the time of orbiting satellites synchronised with GPS receivers on Earth using an atomic clock.By exploiting the signals that the GPS sensor on Earth receives from at least four GPS-orbiting satellites, it is possible to compute the position of the GPS receiver [8].The validity and reliability of the GPS sensor have been already assessed [9][10][11].The data collected from the micro wearable devices record the physical load expenditure of athletes during training sessions and games and can be harnessed to assess players' performances [12], prevent injuries [13,14], quantify match running performances [15], and optimise rehabilitation training [16].
This study uses data collected during games of Gaelic Football (GF), a sport that originated in Ireland, with unique rules that make it a combination between rugby and soccer.Players can carry the ball in their hands but must bounce it to the ground or tap it with their foot every four steps.It is possible to pass the ball with both hands and feet.The goalpost is similar to rugby, and it is possible to score above the crossbar (one point) or below it (three points).GF games are played by two teams of 15 amateur players.The size of the pitch is about 130-150 m in length and 80-90 m in width.GF can be played at the club (local) or county (regional) levels.County games last 70 min, with two halves of 35 min each.

Problem Statement and Contribution
Studies on Gaelic Football aim to furnish comprehensive running performance profiles for players engaged in competitive matches.This is enabled by wearable technologies, which automate the capture of high volumes of information during most or all sporting activities, such as training and matches.Despite the fact that this form of data harvesting has been widespread for over a decade, the application of machine learning and deep learning techniques is still in the embryonic stage.More complex analyses, such as the prediction of players' running performances, is not currently addressed.To date, player running profiles have primarily been established through retrospective analysis of historical data.Recent advancements in machine learning techniques have unlocked the potential for forecasting forthcoming events and generating insights.However, automated data acquisition as typically offered by GPS devices delivers data purely on the basis of the location at a specific time, albeit at a very granular level.While some manufacturers will offer to transform this into a more usable format (effort in terms of speed and distance), it is not necessarily at a level suited to machine learning functions and, in some cases, suffers from information loss due to the highly abstract nature of the report data [17].
In this research, we present a multistep framework to manipulate raw sensor values with the objective of creating a feature set suited to predicting further actions or events during game time analysis.In effect, it offers a more data-centric approach to decision making in sports coaching and management.As part of this process, we articulate the contribution of this work as follows: • A methodology with a parameterisable ruleset for feature extraction from raw sensor data.In effect, a per-second stream of locations is converted to a sequence of actions.

•
An algorithm to convert a sequence of actions into an event where the event captures movement from (close to) rest and a return to the same resting state.

•
The application of existing research, covering typical in-game player workloads, to offer a comparative validation for our feature set creation.As part of this validation, a dimensional analysis of game/player workloads is presented by game (time), player, or action.• Finally, as a proof of concept for the application of machine learning to predict player performances towards the end of the game, the first half of a selected game plus the other previous games are used to predict the number of metres covered at high speed for the second half, with a discussion as to how these machine learning functions performed over the task.

Paper Structure
Section 2 provides a discussion of past feature engineering methods on sensor data and past GF research on players' running performances.Section 3 describes the data acquisition and the multistep framework to obtain the Actions and Events datasets.Section 4.2 presents the validation of the Actions dataset, a set of statistics on the players' actions and events during gameplay, and the prediction of forthcoming events.Concluding remarks are presented in Section 6.

Feature Engineering
Data collected by wearable sensors represent a time series comprising latitude, longitude, and speed values.If properly analysed and processed, it is possible to extract valuable information and insights [18].Time series can be analysed directly, or it is possible to extract features describing trends and variances [19].A valid and effectual method for pattern recognition in time series analysis is to define the time series with respect to the distribution of data points, correlation properties, trends, and data spread [19].The engineering of the raw data for obtaining new transformed features is a common process in GPS data analytics.The researchers in [20] proposed a transportation-mode classification method based on combining feature engineering techniques and a Light Gradient Boosting Machine to discover seven kinds of transportation modes from GPS trajectory data.The original trajectories were divided into sub-trajectories (sequences of GPS data points), where each sub-trajectory represents a unique transportation mode.Then, on these sub-trajectories were extracted the following features: the distance, average velocity, expectation velocity, expectation acceleration, 95th percentile velocity, 95th percentile acceleration, minimum velocity, variance velocity, heading change rate, stop rate, and velocity change rate.Working on the same dataset, the authors in [21] used a different set of features for the classification.In addition to the speed and acceleration, they computed the jerk, bearing, and bearing rate for each data point.Therefore, for each sub-trajectory example, they extracted the minimum, maximum, mean, median, standard deviation, and 10th, 25th, 50th, 75th, and 90th percentile for each of the specific features.
A follow-up study [22] improved the prediction results by adding additional features, both statistical attributes (kurtosis, skewness, coefficient of variation, and autocorrelation coefficient) and domain knowledge (stop rate, velocity change rate, and head change rate).The new set of statistical features benefits the final prediction [22], but some of them are slightly different measures even for the same concept (for example, the standard deviation and coefficient of variation have a similar formula and record the spread of data).This can lead to, for example, multicollinearity (an approximately linear relationship between two or more independent variables) in the data, which can damage the effectiveness of the model [23].
To recognise modes of driving railway trains from GPS data, the authors in [24] derived multiple features from the speed of locomotives during routes (in addition to some domain-specific features).The set of features can be grouped into global statistical features (mean, standard deviation, mode, median, max three consecutive values, min three consecutive values, value range, interquantile range, skewness, kurtosis, coefficient of variation, autocorrelation coefficient, stop rate, velocity change rate, distance), local statistical features (mean length of each decomposition class, length standard deviation of each decomposition class, proportion of each decomposition class, change times), and timedomain and frequency-domain features (median crossover rate, number of peaks, shorttime Fourier transformation).

Players' Running Performance Research in Gaelic Football
In the last decade, sports institutions and teams developed a deep interest in datadriven approaches, aiming to support decision makers in obtaining a competitive advantage.Data obtained from wearable sensors can be exploited in several ways, from describing, planning, and monitoring external loads to injury prediction.External loads in invasion field-based team sports (IFTSs) can be described by measures of total distance covered (or in specific speed bands), accelerations, or metabolic power [25].Such metrics are the result of defined computation on data obtained via tracking systems (wearable devices) worn by the players during training sessions or official games.
The authors in [15] analysed the match running profiles of GF players.They collected data from 50 male GF players using a 4 Hz GPS sensor across 30 competitive games with a total of 212 full-game datasets collected.The variables analysed were the following: high-speed distance, sprint distance, mean velocity, peak velocity, and number of accelerations.The average match distance covered by the players was 8160 ± 1482 meters m with 1731 ± 659 m covered at high speed.The sprint distance was 445 ± 169 m with 44 sprint actions on average per match.The peak velocity was 8.4 ± 0.5 m/s with a mean velocity of 1.8 ± 0.3 m/s.The number of average accelerations completed per player was 184 ± 40.Significant differences between positions were found for total running distance, high-speed running distance, and sprint distance, with midfielder being the position most demanding for the variables total and high-speed running distance.The researchers found a significant reduction in the high-speed and sprint distance between the first and second half.
In later work by the same authors [26], they aimed to identify the position and duration of the running performance of GF players using a moving average window.The study was performed on 35 players across 32 competitive matches analysed with 300 full-match play data samples using a 4 Hz GPS sensor.Players were grouped based on their position: full-back, half-back, midfield, half-forward, and full-forward.The researchers tested ten different lengths (1 to 10 min) for the moving average to analyse the recorded speeds.
The differences in running performance between GF positions have been investigated by [28].The running performance is dependent on the position on the pitch (p < 0.001).Midfielders cover a greater amount of total and relative distances compared to the other positions.Half-backs and half-forwards ran greater total distances compared to full-forward and full-backs.Midfielders were also observed to cover a greater amount of distance at high speeds compared to all other positions, while half-backs and half-forwards traversed distances at high speeds more than the other remaining position groups.Sprinting distances were covered more by half-forwards, followed by half-backs and midfielders.
Similar differences in running performance between different GF playing positions have been confirmed in [29].The authors found that the midfielders covered a significantly greater distance than defenders and that the frequency of jogging, cruise running, striding, and walking was greater in midfielders rather than in the forward positions [29].
The study [30] reported that GF players perform 166 ± 41 accelerations per game.The high-speed running distance and very high speed running distance were reported to be equal to 1563 ± 605 and 524 ± 190 m, respectively.The average distance covered during acceleration is 267 ± 45 m, distributed at 12 ± 5 accelerations per 5 m epoch.The maximum distance covered in acceleration is 296 ± 134 m.

The Concept of Movement Event in Sport Science Research
The concept of the movement event used in this research is based on the study [31], which designed a new approach to detect recurrent movements in sports by analysing positional data.Speed, acceleration, and angular velocity values were obtained from sequential sequences of positional data and clustered with the k-means algorithm, in order to create subgroups of values.Each action is composed of three values for the three dimensions.Sequences of actions were compared, assessing similarities using the longest common sub-string algorithm [31].
The authors in [32] proposed a framework for extracting movements from sequential GPS data.Similar to [31,32], they combined the speed, acceleration, and direction of the movement (turning angle) for defining a player movement description, utilising thresholds for each movement descriptor in order to assign individual labels (letters).Each sequence of movements starts and ends when a player is at a speed less than or equal to 1.2 m/s.The researchers in [32] compute distances among movement descriptors using the Levenshtein distance, which computes the number of required steps to transform one list of strings into another.
Based on the concept of an event as a sequence of actions between resting states defined in [31,32], this research satisfies the necessity of identifying profiles of events.Events are analysed using frequencies, the number of actions, the average and max duration, and the average and max distance covered.
Summary.Table 1 summarizes the key insights and contributions from past research relevant to our study.12 elite international-level female netball during four competitive games using a Wireless Ad Hoc System for Positioning (WASP) Method to discover the frequently recurring movement sequences across playing positions during matches using radio frequency data White et all.[32] 13 professional male rugby players during one Super League match using a 10 Hz GPS sensor Framework to identify sequential movement sequences using GPS-derived spatiotemporal data in team sports We build on each of the three areas in the research described above, as we have developed a novel multistep framework, designed for the purpose of feature engineering raw sensor data into multi-granular datasets, for the purpose of exploiting machine learning functions to predict future events during game time.To the best of the authors' knowledge, this framework represents a pioneering approach directed towards the acquisition and manipulation of features, to enable both machine learning applications and the analysis of actions and events utilising sensor data within the field of sport analytics.In addition, predictive modelling techniques are employed to forecast the high-speed distance of forthcoming events during the second half of competitive games based on the acquired data from the first half and the previous games.

Enrichment of Raw Sensor Data
In this section, we present a multistep method illustrated in Figure 1, which transforms raw sensor data into a series of higher-level events, which are more easily analysable.Our method is based on the concepts of actions and events, which deliver data more suited to rich forms of analysis and a broad range of machine learning functions.In summary, for each second of the game, speeds are converted to one of six action labels: 'standing', 'walking', 'jogging', 'running', 'high-intensity running', and 'sprinting', according to velocity thresholds widely accepted and defined in the literature [34].An event is considered to be a collection of sequential actions, bookended by either 'standing' or 'walking' actions.In summary, players are regarded as being relatively static before commencing into some form of movement or sequence of actions.This feature engineering process is underpinned by a set of rules or logic taken from domain experts and the existing literature.The values shown in Table 2 represent six activity zones, which are standard across many sports, making this feature extraction process widely generic.Furthermore, the adjustment of the parameters or values within zones enables a customisation across sports, gender, and age.

Data Acquisition
The first step shown in Figure 1 is data acquisition, which refers to the generation of data on devices strapped to players and the extraction of that data from each device.
For this study, data were collected during 11 competitive GF inter-county games throughout the seasons 2019-2021.The players wore a micro GPS sensor device (STATSports Apex 10 Hz, Northern Ireland, UK), placed on the upper back, within a vest that fitted the body comfortably.Previous research assessed the accuracy and reliability of the specific sensor involved [11].The study claims that the 10 Hz Apex unit reported a small error of around 1-2% of the distances measured in the experiments and that the unit could be used with confidence to measure distance variables during training and match play [11].
The GPS sensor records 10 observations for each second for the variables latitude, longitude, and speed (m/s).The raw data, shown in Table 3, were collected only when the athletes were actively taking part in the game and were provided already filtered by the STATSports software (version: 4.5.19).

Aggregation (Temporal Rollup)
The Data Aggregation step (which we refer to as temporal rollup) is necessary as data are sampled at too fine a level of granularity for our research purposes.Raw data files may, for example, contain data sampled at 10 Hz (10 observations per second).For the scope of this paper, a single observation per second provides a sufficient level of detail, which, in effect, is a temporal rollup to 1 s values.Thus, the input to this step is the 10 Hz dataset and the output is a dataset with one observation per second.
Our decision was to adopt the average values within each 1 s interval, as it provides a good approximation for describing movement within that timeframe.The location (latitude and longitude) is transformed to the centroid of the locations covered, while the variable speed would be the average values during the second (Table 4).

Feature Extraction
The 1 s interval data now serve as input to this second step where the goal is to further aggregate the data: in this case, a rollup to the level of action.Each data point will no longer represent 1 s of activity but an entire action, which will likely run over a number of seconds.This is carried out using a set of speed thresholds demarcating the transition from one movement type to another.

Parameter Definition
No standardised set of speed thresholds is available for IFTSs [35] to classify a player's movement.Speed thresholds suggested by [34], which are commonly used in IFTSs, were therefore adopted.Each speed value x i (i = 1, 2, . . ., n, where n is the last second of the game) recorded in the instant of time i is converted to an action using the labelled thresholds (Table 2).
In addition, the acceleration is computed by subtracting each speed value from the previous one.A sample of the dataset after this step is shown in Table 5. Algorithm 1 illustrates the DetectAction algorithm.This algorithm sequentially processes the rows of a dataset related to one of the halves of a game played by a single player.It effectively compresses similar actions into a unified row, simplifying a sequence of consecutive actions.It computes various features on the identified actions, such as the minimum and maximum values, the absolute energy (sum of squared values), the count of values above the average, the sum of all values (sum over the time series values), the length of the longest consecutive subsequence above the mean (longest strike), and the coefficient of variation ( standard deviation mean ) for the variables speed and acceleration.Additionally, the number of accelerations and decelerations are measured as the count of times the acceleration values were ≥2 and ≤−2.end if 23: end while The algorithm also captures the time when the action occurred, the start and end locations on the pitch, and the duration of the action in seconds.It is important to mention that actions are recorded with a minimum duration set at 1 s.
Algorithm 1 exclusively displays the variables of average speed, starting time, and ending time.The variables recording the starting and ending locations adhere to the rule of the starting and ending times, while the remaining variables (min, max, absolute energy, longest strike, etc., both for speed and acceleration) follow the criterion of average speed.
Algorithm 1 iterates through the N rows of a dataset (Table 5) comprising the sequence of the time, location, speed, action, and acceleration of a player during a game.At each iteration i, the action performed by a player is indicated by the variable x i and the time when the action occurs by the variable z i .
The first step of the algorithm is to initialise the required variables avg_speed, avg_-speed_list, start_time_list, end_time_list, y, and z, which will measure the average speed during the action, the list of such speeds, the time when the action begins and the time when it ends, the current stored action, and the row when the action starts, respectively (1)(2)(3)(4)(5)(6).While iterating over each row of the dataset (which has a length equal to N) (7), if the current performed action is equal to the current stored action, then it continues with the next row (8,9).If the two actions are different (10), then the average speed from the row when the action starts to the current row is computed (11) and stored in the list of the speeds (12), the starting time and ending time of the action are added to the corresponding lists (13,14), and the variables storing the row when the action starts (z) and the current action (y) are updated (15,16).If the row is the last row of the dataset (17), the algorithm computes the average speed until this row (18), it stores the value in the list of the speeds (19), and it adds the starting time and ending time of the action to the corresponding lists (20,21).
Algorithm 1 requires the data points to be consecutive on a per-second basis.It thus works on one half per time.It is suggested to apply it individually to each half of the game for each player and subsequently concatenate the results.
A sample of the resulting dataset after this step is shown in Table 6 (some of the features introduced in this section have been omitted).

Direction of Movement
In order to detect the direction of movement, the next step is the computation of the turning angle from the position at time i (beginning of the action) and the position at i + 1 (end of the action) [32] (i = 1, 2, . . ., n, where n is the last second of the game).The turning angle is the angle you would need to turn from the starting direction to the ending direction in order to reach the destination.For obtaining the turning angle, the bearing for the two consecutive points should be computed and then subtracted.The bearing is used to describe the direction or angle between two points, measured clockwise from the magnetic north.It represents the direction you would need to travel in a straight line from point A to point B, measured in degrees.There are different steps needed for computing the bearing:

1.
Compute X and Y [32]: where θ denotes the latitude, L denotes the longitude, A and B denote two consecutive decimal coordinates, and ∆L denotes the difference between L at A and L at B.

2.
Compute the bearing β [32]: The turning angle τ is obtained by subtracting the bearing computed on two successive time points [32]: The numerical turning angle values are expressed as degrees from 0 • to 360 • and are then converted to a label (forward, backward, right, and left) expressing the direction of movement (Table 7)

Event Detection
As per our earlier definition in Section 3.2, an event is a sequence of actions that occur between two resting states 'standing' or 'walking'.Therefore, an event commences when the player transitions out of a 'standing' or 'walking' state (speed ≤ 2 m/s) to performing actions at higher speeds 'jogging', 'running', etc. and finally returns to a 'standing' or 'walking' movement (Figure 2).Each event can be of different lengths and compositions but all start and end with 'standing' or 'walking' unless some unexpected interruption occurs.This definition of an event is based on past research [31,32].The DetectEvent algorithm (Algorithm 2) iterates through the N rows of the dataset corresponding to one half of a player's game (Table 6) to identify actions associated with the same event.current_state ← 0 17: end while Algorithm 2 begins with the initialisation of variables: y s , y w , events_list, id_event, and current_state, representing a variable equal to 'standing', a variable equal to 'walking', the list of the events ids, the current event id, and the current state, respectively (1-5).Current state is an indicator that contains 0 the first time the algorithm meets an action equal to 'standing' or 'walking' or 1 if it has already met one.While iterating over the dataset (6), if the current action x_i is equal to 'standing' or 'walking' (7), two situations can occur.If the current state is equal to 1 (8), a new event starts: the id_event is increased by 1 (9), and the new value is added to the list of events (10).On the contrary, if the current state is equal to 0 (11)), add the id_event to the list of the events (12) and switch the value of the variable current state to 1 (13).If the current action x_i is different from 'standing' or 'walking' (14), add the current id_event to the list of the events (15) and set the current state to 0 (16).
At the end of this step, each action is assigned to an event with a unique identifying label (Table 8).

P.ID Start
Sec.

End
Sec.

Start
Lat.

Start
Lon.

End
Lat.The final step in constructing the Events dataset involves the aggregation of actions and action features into events.The Actions dataset is aggregated into the Events dataset by grouping rows sharing the same event ID.Each row of the final Events dataset represents an event performed by a player in a game.

End
Table 9 displays a sample of the resulting Events dataset.'player ID' indicates the ID of the player; the 'event ID' indicates the ID of the event; 'start_second' and 'end_second' indicate the starting and ending seconds from the beginning to the game 'start_time' and 'end_time', expressed in Table 6 as a specific time in the format HH:MM:SS, have here been converted); 'start_lat', 'start_lon', 'end_lat', and 'end_lon' indicate the starting and ending locations of the event; 'avg speed' and 'std speed' indicate the average and the standard deviation of the speed of the actions composing the event; 'duration' indicates the duration of the event; 'std duration' indicates the standard deviation of the duration of the actions composing the event; 'distance' indicates the distance covered during the event; high-speed distance' indicates the distance covered at speed ≥5.5 m/s during the event; and 'unique turning angles' indicate the number of different directions of movement during the event.
The average and standard deviation has been calculated for the statistical features derived from the speed and acceleration values of each action and presented in Section 3.3.1.These features include the minimum and maximum values, absolute energy, count of values above the average, sum of all values, length of the longest strike, and coefficient of variation.Furthermore, the total count of accelerations and decelerations has been calculated for each event.
Not all the collected features have been displayed in Table 9 due to space reasons.

Statistical Summaries and Feature Set Validation
To validate the framework, metrics previously adopted in GF research are applied to the Actions dataset, aiming to assess whether findings in this research align with those from prior studies.Then, a statistical analysis of the data comprising the constructed dataset is provided.Speeds, distances, actions, and events are analysed by halves or matches in order to show trends, patterns, similarities, and differences.
Several statistical t-tests on the variables have been conducted to investigate differences between the first and second halves of games.Finally, three predictive models are tested on the Events dataset to forecast the high-speed distance covered by players during the second half of a selected game, based on the first half and the remaining games.

Actions Dataset Evaluation
In order to verify that the Actions dataset is a true reflection of the actual game, it would be necessary to watch and record the movements of all 15 players in real time, a process that is both impractical and would not scale.Instead, our decision was to compare the dataset with the existing literature to ensure that our feature engineering process delivers a dataset that is in line with existing analyses in terms of the overall volumes and intensity of games.In this respect, we investigated if running profiles are coherent with previous findings in GF research using the in-game profile of elite male Gaelic Football in [15].The authors identified an average distance covered of 8160.11± 1482.02 m, with 1731.29 ± 659.76 m covered at speed 4.72 m/s and 445 ± 169 m at speed 6.11 m/s.Similar values are observed in the Actions dataset: the average distance identified is 8633.8± 1573.6 m, with 1453.6 ± 552.7 m covered at high speeds of 4.72 m/s and 503.5 ± 205.1 m at a sprinting speed of 6.11 m/s.Consistent outcomes were observed even for both the average and peak speeds: 1.8 ± 0.3 m/s and 8.4 ± 0.5 m/s, respectively, in [15] and 2.8 ± 1.6 m/s and 8.4 ± 0.5 m/s in this current research.
In a separate study conducted using the same data [36], the authors observed a reduction in the distance covered between the first and other quarters of the game: −4.1% with the second, −5.9% with the third, and −3.8% with the fourth.They observed a reduction in high-speed running distance (≥4.72 m/s) in the second (−8.8%), third (−15.9%), and fourth (−19.8%)quarters when compared to the first quarter.This research found, in part, a similar reduction in the distance covered between the first and other quarters of the game: −6.6% with the second, −6.6% with the third, and −15.8% with the fourth.When comparing the high-speed running distance between the first and other quarters, the reductions were equal to −7.2% (second), −11.3% (third), and −17.1% (fourth).
The previous comparisons reveal a consistent analogy between this research's findings with those obtained by prior GF research.The marginal differences identified may be attributed to the different sample sizes or players' fitness levels.

Games: Half by Half Analysis
The goalkeeper has been omitted from the analysis due to the difference in physical demands.The average speeds during the first and second halves are 2.84 ± 1.62 m/s and 2.73 ± 1.61 m/s, respectively.The t-test reports a statistically significant difference between the two means (p-value: 1.89 × 10 −12 ).
The count, average, and standard deviation of the duration and maximum duration of the aggregated actions are shown in Table 10.The mean count of actions for each game is 3945 ± 286.The average duration of the actions is 2.9 ± 0.1 s.The average of the maximum duration of the actions is 25.2 ± 3 s.
The resulting p-value for the t-test on the count of actions is 0.004 (lower than the chosen level of significance, 0.05), showing a statistically significant difference for the number of actions in the two halves of the game, which is always higher in the first half.

Analysis of Extracted Actions
An analysis of all labelled actions is presented in Table 11.The average number of actions per game decreases as the speed increases: 'jogging' (4400.5 ± 226.5), 'running' (2402.5 ± 138.9), 'high-intensity running' (827.9 ± 54.3), and 'sprint' (155.1 ± 14.6).A t-test has indicated the presence of a statistically significant difference in the mean duration of 'high-intensity running' actions between the first half (2.0 ± 0.1 s) and the second half (2.2 ± 0.1 s).In reference to the other actions, the mean duration of 'jogging', 'running' and 'sprint' actions are 3.4 ± 0.1, 2.3 ± 0.1, and 2.1 ± 0.2 s, respectively.
A statistically significant difference has been identified through a t-test for the mean distance per 'high-intensity running' action between the first and second halves (12.4 ± 8.7 and 13.1 ± 9.6 m).
The remaining actions revealed similar outcomes when comparing the two halves.

Analysis of Events
Events composed only of the actions 'standing', 'walking', and 'standing' and 'walking' are removed from the current analysis.
The average number of events during the first and second halves are 1657.3± 106.7 and 1538.2 ± 94.9, respectively.
The highest number of events during the first half is 1937, with 7.1 ± 6.1 s of average duration and 63 s of maximum duration for an event (Table 12).The highest for the second half is 1669 with 7.1 ± 6.2 s of average duration and 72 s of maximum duration for an event.
The average number of actions per event is 2.55 ± 2.25, with 1 as the minimum and 21 as the maximum.

Case Study: Predicting High-Speed Actions
While the main focus of this work is on engineering feature sets from the raw sensor data, it is useful to illustrate a case study of the types of possible analyses using the new feature set.Here, we run a number of experiments to predict the high-speed distance covered by players during games, comparing the results obtained by three different machine learning models.
Knowing this metric in advance can be useful for sports coaches to manage players' time on the pitch, for example, to avoid fatigue or even injuries.The challenge is to predict, for each player, the high-speed distance covered per event during the second half of a target game.Then, these predicted distances are summed to obtain the predicted total high-speed distance per player.
The dataset used to complete this task is a transformation of the Event dataset (Table 9).For each player, a sequence of 30 consecutive high-speed distance values is used as a new set of features, and the variable to predict is the high-speed distance at the 31st position (Table 13).To replicate a real-world scenario, each model predicts the high-speed distance values for the second half of the last game in a sequence of games, using training using data from the first half of the same game and all prior games in the sequence (in this case study, 10 previous games).The time series data of high-speed distance for the remaining games plus the first half of the tested game are merged to form a unique time series.The first half of the tested game is placed at the end of this time series, as shown in Figure 6.These data constitute the training set for the predictive experiments.While the learning phase uses data from prior games together with the first half of the target game, during the testing phase, the test set is the model's predictions.This approach is known as recursive forecasting.In this method, the model is trained on historical data, and rather than having a separate test set, it uses its predictions as inputs for future predictions.This experiment simulates the real scenario of the second half distances being forecast during the half-time break based on data from the first half and the previous games.
Assuming N as the number of events performed by player i during the first half, validation can thus be described as follows: 1.
The sequence composed of the last 30 high-speed distance values of the first half is used as input (test set) by the model to generate the first prediction.

2.
The sequence composed of the last 29 high-speed distance values of the first half plus the first prediction of the second half is used as input by the model to generate the second prediction.

3.
Finally, the Nth prediction uses the previous N − 1 predictions.If N − 1 < 30, the remaining inputs are taken from the final high-speed values in the first half.
At the start of the second half, the number of player events for the half is unknown, and for the purpose of the experiment, we assume this number to be equal to the number of events in the first half.Thus, the recursive prediction process is repeated as many times as the number of events performed by the player during the first half.
It was decided to use customised models per player (as opposed to one model for all player positions) by training and testing players separately.The players involved are those who played during the first half of the tested game and played in at least one of the previous games.The set of features used is the same for each model.Three separate models were applied to solve the predictive task: 1.
The LSTM and CNN networks have been implemented in Keras [42].The predicted event-by-event high-speed distances are summed for each player.In this way, it is possible to obtain the total predicted high-speed distance for the second half.Table 14 compares the predicted value with the true distances covered at high speed for each player.Only those players who played for the entire first and second halves are included.The goalkeeper has been removed from the analysis.
The mean absolute error (MAE) measures the average absolute difference between the predicted values and the actual target values and was used as it is less sensitive to outliers than the root mean squared error.For each model, the MAE between the predicted and the true high-speed distances, displayed in Table 14, has been computed.The MAE is equal to 141.7, 86.2, and 116.4 m for the XGBoost, LSTM, and CNN models, respectively.In short, the LSTM model, to predict the total distance at high speed covered during the second half of the tested game, makes, on average, an error of 86.2 m per player.The average distance at high speed covered by the 10 analysed players during the second half of the game was 358.4 m.On average, the LSTM model makes a percentage error of 24.1% on the total real high-speed distance.The calculation of the MAE for each event is made impossible by the assumption of an equal number of events between the first and second halves.Indeed, the number of events in the second half (y test) is different from the number of events in the first half (y pred) (Table 14).
We should make clear the limitations of this particular case study.The high-speed distance in the second half could be estimated by fitness coaches by combining the average speed and the peaks of the speed of players.However, our goal is to demonstrate the possibilities using this novel feature set.As this is the first attempt to combine GPS data and machine learning models to predict future high-speed distances, it represents an initial baseline for future comparisons in this task.

Conclusions
The application of wearable sensor data to analyse the movements, actions, and positions of players for decision support is underused in terms of the potential exploitation of data.The existing literature describes relatively conservative approaches in how wearable sensor data are processed and applied.To the best of the authors' knowledge, this is the first framework-based approach to process raw GPS data into a feature set suitable for supervised and unsupervised machine learning tasks in the sporting domain.The overall process comprises a number of distinct steps.A temporal rollup to 1 s values is performed by computing the centroid of the locations and the average speed in the 10 observations per second.The average speed is subsequently converted to action labels, which are well understood in the literature.Consecutive rows showing the same actions are aggregated, and several features are computed to form the Actions dataset.The DetectEvent algorithm uses the Actions dataset to group actions into a single event by collecting summary features, to build the Events dataset.
The Actions dataset was validated by comparison with metrics presented in the related literature.A machine learning case study was developed using the Events dataset to show one potential application of the feature engineering methodology.The problem of predicting the high-speed distance covered by players in the second half of a game based on past games and the first half is a new idea, and while this represents a relatively simple study, it represents a baseline for future improvements.
It is important to highlight that this work provides a relatively small case study demonstrating the effectiveness and comparative capabilities of machine learning models.Future research is deliberately planned to enhance the selection, training, and evaluation of machine learning models to ensure the accuracy and stability of the models.Using the dataset created by the work presented here, we are currently focusing on graph-based machine learning for predicting high-speed and distance events late in games; ensemble learning models to predict the future sequence of actions; and anomaly detection functions to detect unusual event sequences within games.

Figure 2 .
Figure 2. Event identification from the raw sensor data.

Figures 3 -
show the frequency of values for the features 'duration', 'distance', and 'high-speed distance'.The shape of the frequency distribution is similar for the three investigated features, characterised by elevated frequencies at lower values and a visible decline as values increase.

Figure 3 .
Figure 3. Left: Frequency of events per duration.Right: Frequency of events per duration ≥ 30 s.

Figure 4 .
Figure 4. Left: Frequency of events per distance.Right: Frequency of events per distance ≥ 130 m.

Figure 5 .
Figure 5. Left: Frequency of events per high-speed distance.Right: Frequency of events per high-speed distance ≥ 35 m.

Figure 6 .
Figure 6.The time series of high-speed distance for the remaining games are merged to form a unique time series.The first half of the tested game is placed in the last position.These data constitute the training set for the predictive experiments.

Table 1 .
Summary table presenting the main contributions of related research.
[26]arison of the metabolic power demands between positional groups and examination of the temporal profile of elite GF match play Cullen et al.[29]85 male U-18 GF players during 17 competitive games using a 10 Hz GPS sensor Evaluation of the physiological profiles and activity patterns in club-and county-level under-18 GF players relative to playing position Ryan et al.[30]36 male GF players during 19 competitive games using a 4 Hz GPS sensor Acceleration profile of elite GF match play Gamble et al.[27]36 male GF players during five games using a 10 Hz GPS sensor Running profiles of GF players for playing position and evaluation of the trend of physical performance during the games Malone et al.[26]35 male GF players during 32 competitive games Position-and duration-specific running performance of GF playersSweeting et al. [31]

Table 2 .
Speed zones and parameters.

Table 3 .
Sample of raw data recorded by STATSports Apex 10 Hz sensor.The data shown have been made by the authors to resemble the original data.The names of some columns in the header have been shortened: latitude (Lat.) and longitude (Lon.).

Table 4 .
Sample of the aggregated data.The names of some columns in the header have been shortened: latitude (Lat.) and longitude (Lon.).

Table 5 .
Sample of the aggregated data with labelled actions.The names of some columns in the header have been shortened: latitude (Lat.) and longitude (Lon.).

Table 6 .
Sample of the aggregated data after the compression of similar actions.The names of some columns in the header have been shortened: start latitude (Start Lat.), start longitude (Start Lon.), end latitude (End Lat.), and end longitude (End Lon.).

Table 7 .
Direction of movement expressed as labelled turning angles.

Table 9 .
Schema of the Events dataset at the end of the multistep framework.The names of some columns in the header have been shortened: player ID (P. ID), event ID (E.ID), start second (Start Sec.), end second (End Sec.), avg speed (Avg Sp.), std speed (Std Sp.), duration (Dur.), standard deviation duration (Std Dur.), distance (Dis.), high-speed distance (High-Speed Dist.), and unique turning angle (U.T.A.).

Table 10 .
Half by half analysis: action count, average, and maximum duration, sorted by count of actions.

Table 11 .
Count, average, and maximum duration and distance covered for each type of action during the sampled games, sorted by average distance.

Table 12 .
Count of events, actions, average number and duration of actions in an event, and maximum number and duration of actions in an event per half of the investigated games, sorted by count of events.

Table 13 .
Structure of the transformed dataset used for the prediction of high-speed distance (H.S.D.) covered by players during the second half of a game.The target feature to predict is the variable 31st H.S.D located at the end of the sequence.

Table 14 .
Player by player predictions and true values for the total distance at high speed during the second half of a randomly tested game using XGboost, LSTM, and CNN models defined in Section 5.The values presented in the tables are expressed in metres.